Laurel Brehm
MEM = a mixed effect model. That means it’s a model that has fixed and random effects.
Anything you can analyse with a repeated measures ANOVA, you can analyse with MEM. In addition, MEM allows crossed random effects, both categorical AND continuous predictors, and is good for unbalanced data sets. It’s a very versatile tool.
We’ll start out with a simple model that has a continuous dependant measure (reaction time; RT) and a single continous predictor (raw frequency). In the experimental design, data were collected from many people (subjects) who observed many items (words). Each person saw each word and made a decision about it. In addition, each word is associated with one frequency level.
In model terms: Predict the dependent measure RT from the fixed effect of raw word frequency, plus random intercepts for subjects and items.
Fixed effects are things that are measured/manipulated on purpose: things you predicted could have an effect on the dependent measure. These are observations not drawn at random, so their effects are allowed to vary away from zero. Frequency is a fixed effect because we manipulated it in the experiment. So is Native Language. Start by making a plot of these:
Random effects are things that are drawn at random (hence the name). If you have a ‘repeated measures’ design, the repeated measures will almost always be your random intercepts. In the model, their average effect will always be zero, but they may capture variance. Each person might respond a little differently, and each word might be a little different, even if it has the same level of frequency.
Random slopes are like a slope term in the fixed effect portion of the model (a predictor), but instead, we assess whether they should vary by the group that defines the random intercept.
Make plots of these as well:
ggplot(lexdec,aes(x=Frequency, y=RT,color=Subject)) +
geom_point() + geom_smooth(method='lm') + theme_bw() + scale_color_viridis_d()ggplot(lexdec,aes(x=Frequency, y=RT,color=Word,shape=NativeLanguage)) +
geom_point() + geom_smooth(aes(lty=NativeLanguage),color='black',method='lm') +
theme_bw() + scale_color_viridis_d()Now put these pieces together for a model. It’s a similar syntax to lm, but we’re adding an ‘er’ to the end, and a piece for the random terms:
model output <- lmer ( DV ~ 1 + Fixed Effects + (1 + Random Slope | Random Intercept), data=Data)
We will also include an optional argument today, REML = F. This means “restricted maximum likeihood” is NOT the methods we’ll use– we will use maximum likelihood. This is useful for model comparisons, which is something we will do today to evaluate the effect of frequency.
The 1 + before the fixed effects is actually also optional. It means ‘give me an intercept’, but the model will by default, give you an intercept.
Here is the maximal model involving frequency and native language that is justified by our data: include the interaction between them (as a fixed effect), plus a random intercept for Subject and Word, and a random slope for Frequency by Subject, and a random slope of Native Language by Word.
Think: why is this random effect structure the maximal one justified?
contrasts(lexdec$NativeLanguage) <- c(-.5, .5)
mem1 <- lmer ( RT ~ 1 + Frequency*NativeLanguage + (1 + Frequency | Subject) + (1 + NativeLanguage | Word), data=lexdec, REML=F)## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.0178724
## (tol = 0.002, component 1)
This model fails to converge. Looking at the random effects only (or Variance components and Correlations), we can see why: the random slope for NativeLanguage is highly correlated with the random intercept for Word. Take it out.
(Often models will also not converge if there is an effect that accounts for almost no variance– there will sometimes be a warning about Singular Fit in this case. It’s also the case that as best practice, you should always take out higher-order terms that account for little variance, starting with any interactions in random slopes.)
## Groups Name Std.Dev. Corr
## Word (Intercept) 0.056464
## NativeLanguage1 0.034831 0.930
## Subject (Intercept) 0.183642
## Frequency 0.014381 -0.864
## Residual 0.169873
This model fits better!
mem2 <- lmer ( RT ~ 1 + Frequency*NativeLanguage + (1 + Frequency | Subject) + (1 | Word), data=lexdec, REML=F)
summary(mem2)## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula:
## RT ~ 1 + Frequency * NativeLanguage + (1 + Frequency | Subject) +
## (1 | Word)
## Data: lexdec
##
## AIC BIC logLik deviance df.resid
## -961.2 -912.5 489.6 -979.2 1650
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4091 -0.6154 -0.1101 0.4690 6.1777
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Word (Intercept) 0.002916 0.05400
## Subject (Intercept) 0.033702 0.18358
## Frequency 0.000204 0.01428 -0.87
## Residual 0.029173 0.17080
## Number of obs: 1659, groups: Word, 79; Subject, 21
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 6.609232 0.049552 133.378
## Frequency -0.044834 0.006610 -6.783
## NativeLanguage1 0.286343 0.087308 3.280
## Frequency:NativeLanguage1 -0.027472 0.009158 -3.000
##
## Correlation of Fixed Effects:
## (Intr) Frqncy NtvLn1
## Frequency -0.827
## NativeLngg1 0.126 -0.081
## Frqncy:NtL1 -0.103 0.099 -0.815
To interpret what the model means, look at the fixed effects– estimates, the error around them, and the t-value (ratio of estimate to error). Frequency, Native Language, and their interaction all have reliably large effects compared to the error of the estimate.
The overall intercept reflects the estimated (log) RT value when predictors are zero– since we did effects coding for Native Language, this would reflect the RT we’d expect for a word with zero log Frequency averaged across both Native Language conditions.
The effects here are main effects: the change in RT with a one-unit change in log Frequency or Native Language (the difference in average RT going from English to Other).
The interaction reflects that the different levels of Native Language have different slopes by Frequency: the Native Language = Other group has a steeper slope, because the negative contrast is for Other, and the estimate is negative. (We also know this by looking at the plot)
We also see info about the random effects. There is no estimate term for them: the estimate is defined as zero for random effects. But there is a variance term for each– each of our random effects is a group of points within the data that is systematically variable from the cell that the data belong to. There are also correlations between slopes and intercepts fit to the same group of data (here: subject)
mem2 <- lmer ( RT ~ 1 + Frequency*NativeLanguage + (1 + Frequency | Subject) + (1 | Word), data=lexdec, REML=F)
summary(mem2)## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula:
## RT ~ 1 + Frequency * NativeLanguage + (1 + Frequency | Subject) +
## (1 | Word)
## Data: lexdec
##
## AIC BIC logLik deviance df.resid
## -961.2 -912.5 489.6 -979.2 1650
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4091 -0.6154 -0.1101 0.4690 6.1777
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Word (Intercept) 0.002916 0.05400
## Subject (Intercept) 0.033702 0.18358
## Frequency 0.000204 0.01428 -0.87
## Residual 0.029173 0.17080
## Number of obs: 1659, groups: Word, 79; Subject, 21
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 6.609232 0.049552 133.378
## Frequency -0.044834 0.006610 -6.783
## NativeLanguage1 0.286343 0.087308 3.280
## Frequency:NativeLanguage1 -0.027472 0.009158 -3.000
##
## Correlation of Fixed Effects:
## (Intr) Frqncy NtvLn1
## Frequency -0.827
## NativeLngg1 0.126 -0.081
## Frqncy:NtL1 -0.103 0.099 -0.815
Recall the same plots we made for the fixed-effect only regression: we can make them again here to diagnose model fit. Note that the syntax for the first model is different: we can take a shortcut to extract the data for lmer objects.
These plots indicate that this model is a lot better. Adding the random effects (and the extra fixed effects) has substantially improved model fit.
There are no p-values associated with lmer models. This is because there is no closed-form solution to the model– all the random effects and fixed effects are modeled at once, meaning there is no well-defined notion of a ‘degree of freedom’, which is the missing piece to define p-values.
That’s ok, because you don’t need p-values.
Really.
All you need really is to know that the ratio of effect to error is reasonably large. This means you can decide that t>2 is sufficient evidence for your hypothesis (2x as much estimate as error). You can cite Baayen 2008 (the R book) as precedent.
Here’s another nice option: get a confidence interval on your parameters. If the 95% CI doesn’t cross zero, the parameter is reliably significant.
The ‘sig’ terms are the random effects: slopes, intercepts, correlations, and residuals.
(There is an error here in that the sig03 term– the correlation between random intercept for Subject and the random slope for Frequency by Subject– is reaching a boundary of -1. In this case, it doesn’t matter because I don’t want to draw inferences around the random effects.)
## Computing profile confidence intervals ...
## Warning in FUN(X[[i]], ...): non-monotonic profile for .sig03
## Warning in confint.thpr(pp, level = level, zeta = zeta): bad spline fit
## for .sig03: falling back to linear interpolation
## 2.5 % 97.5 %
## .sig01 0.042439244 0.067877578
## .sig02 0.130819932 0.267070018
## .sig03 -1.000000000 -0.511665712
## .sig04 0.005024811 0.024905203
## .sigma 0.164939990 0.177015916
## (Intercept) 6.509144374 6.709318895
## Frequency -0.058025196 -0.031642958
## NativeLanguage1 0.107083623 0.465601438
## Frequency:NativeLanguage1 -0.046281746 -0.008662151
But if you do decide you want them, here’s a way to get them: But, but, but….I want p-values! Ok, here’s a way of thinking about it that’s better defined for modeling: how much does including the factor improve my model? We can test this using model comparison. This work flow does a model comparison that is equivalent to using the type-III sum of squares in an ANOVA: drop out one factor at a time and test what it does.
This will only work for predictors that are coded as numerics– here, a dummy coded version of Native Language. So, re-run everything… using a numeric version of native language, coded with -.5, .5.
Because of convergence troubles with some of these models, I’ve also had to simplify the random effect structure.
lexdec$NL2 <- as.numeric(lexdec$NativeLanguage) - 1.5
mem02 <- lmer ( RT ~ 1 + Frequency*NL2 + (1 | Subject) + (1 | Word), data=lexdec, REML=F)
mem20 <- lmer ( RT ~ 0 + Frequency*NL2 + (1 | Subject) + (1 | Word), data=lexdec, REML=F)
mem21 <- lmer ( RT ~ 1 + NL2+Frequency:NL2 + (1 | Subject) + (1 | Word), data=lexdec, REML=F)
mem22 <- lmer ( RT ~ 1 + Frequency+Frequency:NL2 + (1 | Subject) + (1 | Word), data=lexdec, REML=F)
mem23 <- lmer ( RT ~ 1 + Frequency+NL2 + (1 | Subject) + (1 | Word), data=lexdec, REML=F)Next, we compare these models with a series of chi-squared tests… which you call using the command “anova”. Note the last column: this is the p-value associated with the model comparison that leaves the term out. This comes from a chi-squared test on the likelihood ratio of the two models using a chi-squared function with one degree of freedom (one parameter is different). The p-value for this number is very small, meaning that adding frequency relably improves model fit more than would be expected due to chance.
## Data: lexdec
## Models:
## mem20: RT ~ 0 + Frequency * NL2 + (1 | Subject) + (1 | Word)
## mem02: RT ~ 1 + Frequency * NL2 + (1 | Subject) + (1 | Word)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mem20 6 -442.78 -410.30 227.39 -454.78
## mem02 7 -954.47 -916.57 484.24 -968.47 513.69 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: lexdec
## Models:
## mem21: RT ~ 1 + NL2 + Frequency:NL2 + (1 | Subject) + (1 | Word)
## mem02: RT ~ 1 + Frequency * NL2 + (1 | Subject) + (1 | Word)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mem21 6 -912.07 -879.59 462.03 -924.07
## mem02 7 -954.47 -916.57 484.24 -968.47 44.403 1 2.673e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: lexdec
## Models:
## mem22: RT ~ 1 + Frequency + Frequency:NL2 + (1 | Subject) + (1 | Word)
## mem02: RT ~ 1 + Frequency * NL2 + (1 | Subject) + (1 | Word)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mem22 6 -941.75 -909.26 476.87 -953.75
## mem02 7 -954.47 -916.57 484.24 -968.47 14.726 1 0.0001243 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: lexdec
## Models:
## mem23: RT ~ 1 + Frequency + NL2 + (1 | Subject) + (1 | Word)
## mem02: RT ~ 1 + Frequency * NL2 + (1 | Subject) + (1 | Word)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mem23 6 -939.69 -907.20 475.84 -951.69
## mem02 7 -954.47 -916.57 484.24 -968.47 16.784 1 4.189e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: there are other ways of getting p-values.
You can use a Satterthwaite approximation to get degrees of freedom for your t distribution, and get p-values that way (https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf).
Or you can do a type-II model comparison (remove all higher-order terms that effect participates in).
(Or you can just…not do it, seriously, it’s ok.)
Mixed models also work for GLMs– next, let’s assess error rates (correct, incorrect) instead of RTs! Note that we don’t have to re-set contrasts for NativeLanguage because we did that above.
This is very straightforward, using a similar syntax as above, but with a ‘g’ and family=‘binomial’ call (as with glm).
I started out with a somewhat complex random effect structure– the largest justified by the design of the study– and simplified due to non-convergence.
## this model doesn't converge -- remove NativeLanguage by Word slope because it (1) accounts for very little variance, and (2) is perfectly correlated with the Word intercept.
##glmer1 <- glmer(Correct ~ Frequency*NativeLanguage + (1 + Frequency | Subject) + (1 + NativeLanguage | Word), data=lexdec, family='binomial')
## this model converges but is still overfitted-- the Frequency by Subject slope is still highly correlated with Subject intercept, so remove.
##glmer1 <- glmer(Correct ~ Frequency*NativeLanguage + (1 + Frequency | Subject) + (1 | Word), data=lexdec, family='binomial')
glmer1 <- glmer(Correct ~ Frequency*NativeLanguage + (1 | Subject) + (1 | Word), data=lexdec, family='binomial')After removing some random slopes (common for glmer models), we have a fitted model.
We can see that because the intercept is negative, the 0 value (by default, again the first alphabetically= correct) is more likely.
Frequency has a significant effect on likelihood of incorrect responses– higher frequency is associated with more correctness, as indicated by the negative value.
Because this is a glmer, we also use a (Wald) z-value to compare estimate to error, and then, we get a p-value associated with our fixed effect terms. This ype of p-value is ‘anti-conservative’ (too liberal) but an ok approximation that we get for free.
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: Correct ~ Frequency * NativeLanguage + (1 | Subject) + (1 | Word)
## Data: lexdec
##
## AIC BIC logLik deviance df.resid
## 497.1 529.6 -242.6 485.1 1653
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.0664 -0.1801 -0.1237 -0.0882 8.3313
##
## Random effects:
## Groups Name Variance Std.Dev.
## Word (Intercept) 1.0404 1.0200
## Subject (Intercept) 0.6578 0.8111
## Number of obs: 1659, groups: Word, 79; Subject, 21
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5123 0.7364 -2.054 0.039998 *
## Frequency -0.5524 0.1555 -3.553 0.000381 ***
## NativeLanguage1 1.8231 0.9919 1.838 0.066073 .
## Frequency:NativeLanguage1 -0.3119 0.2175 -1.434 0.151654
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Frqncy NtvLn1
## Frequency -0.889
## NativeLngg1 -0.036 0.015
## Frqncy:NtL1 0.027 0.011 -0.882
To assess model fit, we can again plot the distribution of residuals and the qqplot. They will tend to be less pretty than you’d see in a linear model, which relates to the shape of the log odds distribution we are comparing to (steep slope in the middle, shallow on the sides), and the different meaning that ‘residual’ takes here– these reflect not some error term in the model, but the model predicting “correct” when it should have predicted “incorrect”.
But, still informative… these show that we have some points with very positive values that we aren’t accounting for as well. These are unexplaned by our model at present!
One last set of plots to make… what does the effect look like from our model? For glmer models (and for complicated lmer models), I like to extract the effects using the ‘effects’ package
## NOTE: Frequency is not a high-order term in the model
More complex interaction terms are something that is often new to people when moving away from ANOVA.
This is pretty straightforward in (g)lm and (g)lmer models– in fact, you don’t have to do post-hoc comparisons.
First tip: Interpreting multi-way interactions is really just like intepreting two-way interactions, scaled up.
A two-way interaction effect is a difference of differences (a difference in how the two levels of one variable are different on another variable)
A three-way interaction effect is a difference of differences of differences (a different two-way interaction across levels of a different variable)
-> As a heuristic, if there is an interaction effect– there is often just one point that is different from what you expect. Plot your numbers, and find it. This will help guide you in reading your output.
Second tip: But, interpretations of main effects depend a lot on contrasts.
This is because the main effect of variable A is evaluated at whatever zero reflects for variable B.
For the highest-order interaction, it will be the same regardless of what contrasts you put in. This is because it reflects a difference of differences.
Third tip: There will be MORE CONTRASTS for variables with more levels– but you can use this to your advantage.
I am basing this off of the cake data in lme4. This was a food science experiment where some scientists made snack cakes (like Twinkies) and broke them in half to see if the new recipe had the properties of the original: here, breakage angle, which is a proxy for denseness/moistness of cake.
I took the numbers from the original data and re-combined, and resampled, to make a useful teaching example. Details below…
Our DV is ‘angle’ – how breakable the cakes are. I added to this to make an interaction appear in the data.
‘time’ reflects two time points at which the cakes were baked (to get this I split the replicates in half. let’s say that this reflects that the original baker made cakes on Monday, and on Wednesday). A 2 level categorical variable.
‘recipe’ is the original recipe factor. A 3 level categorical variable.
‘temperature’ is a 2 level categorical variable (the highest and lowest of the original temperatures).
‘batch’ is a combination of the original variables of replicate and recipe. This will be our grouping variable.
## setting up the data
cake$time <- ifelse(as.numeric(cake$replicate)<8,'t1','t2')
cake$time <- as.factor(cake$time)
cake$batch <- as.factor(paste0(cake$recipe,cake$replicate))
cake2 <- cake %>% filter(temp==175 | temp==225)
cake2 <- droplevels(cake2)
cake2[(cake2$recipe=='C' & cake2$time=='t2' & cake2$temp=='225'),]$angle <- cake2[(cake2$recipe=='C' & cake2$time=='t2' & cake2$temp=='225'),]$angle + 20I am going to set up effects contrasts for temperature and time (compare time 1 to time 2; compare temp 1 to temp 2; make the effects of each other variable reflect main effects).
For batch, I’ll use a coding scheme called Helmert, where we compare (Batch A and B) to batch C in one contrast, and compare A and B in the other.
Because of these coding schemes (zero is in the middle), this means that the main effects in this model are actually main effects.
## setting up the contrasts
contrasts(cake2$time) <- c(-1,1)
contrasts(cake2$temperature) <- c(-1,1)
contrasts(cake2$recipe) <- contr.helmert
contrasts(cake2$time)## [,1]
## t1 -1
## t2 1
## [,1]
## 175 -1
## 225 1
## [,1] [,2]
## A -1 -1
## B 1 -1
## C 0 2
Fit a model. Note the three-way interaction, involving a three-level variable. (eek! really, don’t try to run anything more complex than this)
Take t > |2| to reflect reliable effects. We observe…
a main effect of recipe–2nd contrast: this means that the C recipe is different compared to the A and B for cake breakage angle.
a main effect of temperature– 225 is different than 175 for cake breakage angle.
an interaction between temperature and time: on average, the 175 temperature has a smaller breakage angle at time 2 than time 1, but the 225 temperature does not.
an interaction between recipe–2nd contrast, temperature, and time: the difference between temperature and time (2-way interaction) differs when comparing level C to the combination of the other two levels.
## Linear mixed model fit by REML ['lmerMod']
## Formula: angle ~ recipe * temperature * time + (1 | batch)
## Data: cake2
##
## REML criterion at convergence: 575
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.43768 -0.47885 -0.05524 0.41175 2.31054
##
## Random effects:
## Groups Name Variance Std.Dev.
## batch (Intercept) 34.50 5.874
## Residual 22.93 4.788
## Number of obs: 90, groups: batch, 45
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 33.5759 1.0130 33.146
## recipe1 -0.4821 1.2406 -0.389
## recipe2 1.7723 0.7163 2.474
## temperature1 5.3824 0.5059 10.640
## time1 -1.9717 1.0130 -1.947
## recipe1:temperature1 0.6473 0.6195 1.045
## recipe2:temperature1 1.7708 0.3577 4.951
## recipe1:time1 -0.5179 1.2406 -0.417
## recipe2:time1 1.3318 0.7163 1.859
## temperature1:time1 1.2634 0.5059 2.498
## recipe1:temperature1:time1 -0.4598 0.6195 -0.742
## recipe2:temperature1:time1 1.6875 0.3577 4.718
##
## Correlation of Fixed Effects:
## (Intr) recip1 recip2 tmprt1 time1 rcp1:tmp1 rcp2:tmp1
## recipe1 0.000
## recipe2 0.000 0.000
## temperatur1 0.000 0.000 0.000
## time1 -0.067 0.000 0.000 0.000
## rcp1:tmprt1 0.000 0.000 0.000 0.000 0.000
## rcp2:tmprt1 0.000 0.000 0.000 0.000 0.000 0.000
## recipe1:tm1 0.000 -0.067 0.000 0.000 0.000 0.000 0.000
## recipe2:tm1 0.000 0.000 -0.067 0.000 0.000 0.000 0.000
## tmprtr1:tm1 0.000 0.000 0.000 -0.067 0.000 0.000 0.000
## rcp1:tmp1:1 0.000 0.000 0.000 0.000 0.000 -0.067 0.000
## rcp2:tmp1:1 0.000 0.000 0.000 0.000 0.000 0.000 -0.067
## recp1:tm1 recp2:tm1 tmp1:1 r1:1:1
## recipe1
## recipe2
## temperatur1
## time1
## rcp1:tmprt1
## rcp2:tmprt1
## recipe1:tm1
## recipe2:tm1 0.000
## tmprtr1:tm1 0.000 0.000
## rcp1:tmp1:1 0.000 0.000 0.000
## rcp2:tmp1:1 0.000 0.000 0.000 0.000
There is a three-way interaction between recipe–2nd contrast, temperature, and time. This means: the difference between temperature and time (2-way interaction) differs when comparing level C to the combination of the other two levels.
This is to say: when you look at the plot, there’s a two-way interaction visible in the A and B panels. There’s also a two-way interaction visible in the C panel– and it’s a different one. Specifically, the last yellow point goes up, rather than down.
ggplot(cake2, aes(x=time,y=angle,color=temperature))+
geom_point()+geom_smooth(aes(x=as.numeric(time)),method='lm')+facet_grid(~recipe)To learn more about mixed models: http://www.bodowinter.com/tutorial/bw_LME_tutorial2.pdf
For trouble-shooting: https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html
And feel free to use my own best-practices guide, which I have put in the folder for this course.